80 research outputs found
"Is a picture of a bird a bird": Policy recommendations for dealing with ambiguity in machine vision models
Many questions that we ask about the world do not have a single clear answer,
yet typical human annotation set-ups in machine learning assume there must be a
single ground truth label for all examples in every task. The divergence
between reality and practice is stark, especially in cases with inherent
ambiguity and where the range of different subjective judgments is wide. Here,
we examine the implications of subjective human judgments in the behavioral
task of labeling images used to train machine vision models. We identify three
primary sources of ambiguity arising from (i) depictions of labels in the
images, (ii) raters' backgrounds, and (iii) the task definition. On the basis
of the empirical results, we suggest best practices for handling label
ambiguity in machine learning datasets
Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs
Large language models (LLMs) have achieved widespread success on a variety of
in-context few-shot tasks, but this success is typically evaluated via
correctness rather than consistency. We argue that self-consistency is an
important criteria for valid multi-step reasoning in tasks where the solution
is composed of the answers to multiple sub-steps. We propose two types of
self-consistency that are particularly important for multi-step reasoning --
hypothetical consistency (a model's ability to predict what its output would be
in a hypothetical other context) and compositional consistency (consistency of
a model's final outputs when intermediate sub-steps are replaced with the
model's outputs for those steps). We demonstrate that multiple variants of the
GPT-3/-4 models exhibit poor consistency rates across both types of consistency
on a variety of tasks.Comment: Added GPT-4 result
BLiMP: The Benchmark of Linguistic Minimal Pairs for English
We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP),
a challenge set for evaluating what language models (LMs) know about major
grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each
containing 1000 minimal pairs isolating specific contrasts in syntax,
morphology, or semantics. The data is automatically generated according to
expert-crafted grammars, and aggregate human agreement with the labels is
96.4%. We use it to evaluate n-gram, LSTM, and Transformer (GPT-2 and
Transformer-XL) LMs. We find that state-of-the-art models identify
morphological contrasts reliably, but they struggle with semantic restrictions
on the distribution of quantifiers and negative polarity items and subtle
syntactic phenomena such as extraction islands.Comment: To appear in TAC
Recommended from our members
BLiMP: A Benchmark of Linguistic Minimal Pairs for English
We introduce BLiMP (The Benchmark of Linguistic Minimal Pairs), a human-solvable challenge set for evaluating language models (LMs) that covers a broad range of major grammatical phenomena in English. BLiMP consists of over 30 datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. Like GLUE (Wang et al., 2018), BLiMP makes it easy to directly compare models. Evaluating n-gram, LSTM, and Transformer LMs (GPT-2 and TransformerXL), we find that transformers are strongest overall, achieving (near) human performance on agreement and binding. However, phenomena like wh-islands and NPI licensing remain challenging even for state-of-the-art LMs
Evolutionary Reconstructions of the Transferrin Receptor of Caniforms Supports Canine Parvovirus Being a Re-emerged and Not a Novel Pathogen in Dogs
Parvoviruses exploit transferrin receptor type-1 (TfR) for cellular entry in carnivores, and specific interactions are key to control of host range. We show that several key mutations acquired by TfR during the evolution of Caniforms (dogs and related species) modified the interactions with parvovirus capsids by reducing the level of binding. These data, along with signatures of positive selection in the TFRC gene, are consistent with an evolutionary arms race between the TfR of the Caniform clade and parvoviruses. As well as the modifications of amino acid sequence which modify binding, we found that a glycosylation site mutation in the TfR of dogs which provided resistance to the carnivore parvoviruses which were in circulation prior to about 1975 predates the speciation of coyotes and dogs. Because the closely-related black-backed jackal has a TfR similar to their common ancestor and lacks the glycosylation site, reconstructing this mutation into the jackal TfR shows the potency of that site in blocking binding and infection and explains the resistance of dogs until recent times. This alters our understanding of this well-known example of viral emergence by indicating that canine parvovirus emergence likely resulted from the re-adaptation of a parvovirus to the resistant receptor of a former host
DataPerf: Benchmarks for Data-Centric AI Development
Machine learning research has long focused on models rather than datasets,
and prominent datasets are used for common ML tasks without regard to the
breadth, difficulty, and faithfulness of the underlying problems. Neglecting
the fundamental importance of data has given rise to inaccuracy, bias, and
fragility in real-world applications, and research is hindered by saturation
across existing dataset benchmarks. In response, we present DataPerf, a
community-led benchmark suite for evaluating ML datasets and data-centric
algorithms. We aim to foster innovation in data-centric AI through competition,
comparability, and reproducibility. We enable the ML community to iterate on
datasets, instead of just architectures, and we provide an open, online
platform with multiple rounds of challenges to support this iterative
development. The first iteration of DataPerf contains five benchmarks covering
a wide spectrum of data-centric techniques, tasks, and modalities in vision,
speech, acquisition, debugging, and diffusion prompting, and we support hosting
new contributed benchmarks from the community. The benchmarks, online
evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry.Comment: NeurIPS 2023 Datasets and Benchmarks Trac
Theory and description in African Linguistics: Selected papers from the 47th Annual Conference on African Linguistics
The papers in this volume were presented at the 47th Annual Conference on African Linguistics at UC Berkeley in 2016. The papers offer new descriptions of African languages and propose novel theoretical analyses of them. The contributions span topics in phonetics, phonology, syntax, semantics, and pragmatics and reflect the typological and genetic diversity of languages in Africa. Four papers in the volume examine Areal Features and Linguistic Reconstruction in Africa, and were presented at a special workshop on this topic held alongside the general session of ACAL
Theory and description in African Linguistics: Selected papers from the 47th Annual Conference on African Linguistics
The papers in this volume were presented at the 47th Annual Conference on African Linguistics at UC Berkeley in 2016. The papers offer new descriptions of African languages and propose novel theoretical analyses of them. The contributions span topics in phonetics, phonology, syntax, semantics, and pragmatics and reflect the typological and genetic diversity of languages in Africa. Four papers in the volume examine Areal Features and Linguistic Reconstruction in Africa, and were presented at a special workshop on this topic held alongside the general session of ACAL
- …